Textual Inversion

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

https://gyazo.com/34f58a71480882f93af1cfe2c457d5d6

Text-to-image models offer unprecedented freedom to guide creation through natural language.

DeepL: Text-to-Imageモデルは、自然言語によって創作を導くという、これまでにない自由度を持つモデルである。

Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes.

しかし、その自由度を活かして、特定の固有概念の画像を生成したり、その外観を変更したり、新しい役割や新しいシーンを構成したりすることは、まだ明らかになっていません。

In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy?

つまり、飼い猫を絵画にしたり、お気に入りのおもちゃから新しい商品を想像したりするために、 language-guided modelをどのように使えばよいのだろうかということである。

Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model.

そのような自由な創造を可能にするシンプルなアプローチを紹介する

ユーザが提供したコンセプトの画像（オブジェクトやスタイルなど）を3-5枚だけ使って、凍結されたtext-to-imageモデルの埋め込み空間において、新しい「言葉」を使ってそれを表現することを学習します。

These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way.

この「言葉」は、自然言語文として構成することができ、直感的な方法でパーソナライズされた創作を導くことができる。

Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts.

特に、ユニークで多様な概念を捉えるためには、単一の単語埋め込みで十分であることを示す証拠を見出した。

We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks

我々は、本アプローチを様々なベースラインと比較し、様々なアプリケーションやタスクにおいて、より忠実に概念を表現できることを実証する。

Text-to-Imageでは入力されたpromptを数値に変換する

具体的には単語をtokenに変換する（tokenは事実上、辞書に乗っている単語）

これを逆変換するのがtextual Inversion

imageを入力するとその視覚的な概念を表す"embedding" （a continuous vector representation for the specific token）を出力する

2208.01618 An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal1,2, Yuval Alaluf1, Yuval Atzmon2, Or Patashnik1, Amit H. Bermano1, Gal Chechik2, Daniel Cohen-Or1,

1Tel Aviv University, 2NVIDIA

Submitted on 2 Aug 2022

@alfredplpl: https://t.co/iMK82nnoPZ

https://pbs.twimg.com/media/FdgAHQEUUAA8WK1.jpg

/nishio/Textual Inversionを試してみる

Textual Inversionが生成する埋め込みベクトルのファイルは5KB程度。中身は768次元のfloatのベクトルがメインで、トークンに関する細かい情報が少し付属してる感じ。

学習の結果得られるものは768次元のベクトル1個なので、複数のベクトルの中から良いものだけ選んで平均したりGAしたりすれば効率よく探索ができるかもしれない

1トークン（大体の場合）=プロンプト1単語分

stable diffusionに自分の好きなキャラクターを描いてもらう事は出来るのか？

アングラでは、エロ画像生成で作家の画風に寄せるために利用されている

Textual inversion embeddings